[Updated: Thu, Jun 03, 2021 - 19:11:14 ]
This is a crash course on Optimization for students in social science. This document does not replace a course on Calculus or Optimization, and only serves the purpose of developing some vocabulary you will likely encounter in any Machine Learning course or textbook. If you had taken Calculus before, it will refresh your memory.
A function is a rule to assign an input number from a defined domain to an output number in another defined domain. A domain is a set of all possible input and output values. For instance, \[f: \mathbb{R} \rightarrow \mathbb{R},\] indicates that \(f\) takes values from the real numbers as an input and then produces an output in the real numbers. When the set of domain is not explicitly defined, we typically assume that domain of inputs is the largest set of possible numbers the rule can apply.
For instance, the following is a function:
\[ f(x) = 5x^2 - 2x -12\] Below are the output of this function for a series of input values.
| \(x\) | \(f(x)\) |
|---|---|
| -3 | 39 |
| -2 | 12 |
| -1 | -5 |
| 0 | -12 |
| 1 | -9 |
| 2 | 4 |
| 3 | 27 |
Some functions are constant. No matter what the input is they always produce the same output.
\[f(x) = a_0\],
where \(a_0\) is a constant number.
f <- function(x) {3}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
xlab('x')+
ylab('f(x)')+
theme_bw()Some functions always produces the output identical to the input.
\[f(x) = x\]
f <- function(x) {x}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
xlab('x')+
ylab('f(x)')+
theme_bw()Some functions are inverse of other functions. Consider the following function:
\[ f(x) = 3x + 5\] For an input value of 2, this function returns a value of 11. If we want to find an inverse of this function, we should come up with something that takes the input value of 11 and returns a value of 2. Following is the inverse of this function:
\[ f^{-1}(x) = \frac{x-5}{3}\]
Polynomial functions are in the form of
\[ f(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + ... + a_nx^n,\]
where \(a_0\), \(a_1\), \(a_2\), \(a_3\),…, \(a_n\) are coefficients, some of which may be zero. The largest power of \(x\) in the function is called the order of the polynomial.
Below is an example for a first-order polynomial function which is a straight line.
\[f(x) = 10x + 5\]
f <- function(x) {10*x + 5}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
xlab('x')+
ylab('f(x)')+
theme_bw()Below is an example for a second-order polynomial function.
\[f(x) = 3x^2 + 10x + 5\]
f <- function(x) {3*x^2 + 10*x + 5}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
xlab('x')+
ylab('f(x)')+
theme_bw()Exponential functions are in the form of
\[ f(x) = a_0a_1^x,\]
where \(a_0\) and \(a_1\) are coefficients.
Below is an example for an exponential.
\[f(x) = (\frac{1}{3})^x \]
f <- function(x) {(1/3)^x}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
xlab('x')+
ylab('f(x)')+
theme_bw()A common exponential function uses \(e\) as a base. \(e\) is a mathematical constant just like \(\pi\) and equals to 2.718282… In R, you can use exp(x) to specify \(e^x\). Below is the graph for \(e^2\)
f <- function(x) {exp(x)}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
xlab('x')+
ylab('f(x)')+
theme_bw()Logarithmic functions are the inverse of exponential functions. The expression \[ x = b^y\] can be written as \[y = log_b(x)\].\(b\) is called the base. In R, you can use log() function to take logarithms.
See the following examples:
[1] 2
[1] 4
The most commonly used base is the base \(e \approx 2.718\) and it is called natural logarithm when the base is \(e\).
[1] 2.302585
Some functions are bounded such that the output values are restricted within a range. Consider the following function.
\[f(x) = x^2 + 5 \]
This function can produce output values only within the the range of \([5,\infty]\)
f <- function(x) {x^2 + 5}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
ylim(0,30)+
xlab('x')+
ylab('f(x)')+
theme_bw()An important bounded function is the sigmoid function (a.k.a., logistic function).
\[f(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{1 + e^x} \] The sigmoid function is always bounded between 0 and 1, and produces an S-shaped curve.
f <- function(x) {1/(1+exp(-x))}
ggplot() +
geom_function(fun = f) +
xlim(-5,5)+
ylim(0,1)+
xlab('x')+
ylab('f(x)')+
theme_bw()Trigonometric functions are defined using a unit circle, a circle centered at the origin and with a radius of 1. An angle always creates a triangle inside the circle. The cosine function takes the magnitude of the angle as an input and returns the signed length of the side of the triangle adjacent to the angle while the sine function returns the signed length of the opposite side. The tangent functions returns the ratio of the signed length of the opposite side to the signed length of the adjacent side.
require(ggforce)
ggplot() +
geom_circle(aes(x0=0,y0=0,r=1))+
geom_segment(aes(x=0,y=-1,xend=0,yend=1),lty=2)+
geom_segment(aes(x=-1,y=0,xend=1,yend=0),lty=2)+
geom_segment(aes(x=0,y=0,xend=sqrt(.5),yend=sqrt(.5)))+
geom_segment(aes(x=sqrt(.5),y=0,xend=sqrt(.5),yend=sqrt(.5)))+
geom_segment(aes(x=0,y=0,xend=sqrt(.5),yend=0))+
geom_curve(aes(x = 0.3, y = 0, xend = .2, yend = 0.2),
color = "black",
arrow = arrow(type = "closed",length=unit(0.1, "inches")))+
xlim(-2,2)+
ylim(-2,2)+
xlab('')+
ylab('')+
theme_bw() +
annotate('text',x=.4,y=-.1,label='cos(x)')+
annotate('text',x=.82,y=.3,label='sin(x)')+
annotate('text',x=.4,y=.5,label='1')In R, you can use cos(), sin(), and tan() commands to calculate the value of cosine, sine, and tangent for a given angle. Note that an angle can be specified in terms of a degrees or radians, and these functions take radians as input. Below is simple formula to convert degrees to radians.
For instance, below code calculates the sine, cosine, and tangent for a 45 degree angle.
[1] 0.7071068
[1] 0.7071068
[1] 1
The cosine and sine functions are always return values between -1 and 1, while tangent can take values from - \(\infty\) to \(\infty\).
f <- function(x) {cos(pi*(x/180))}
ggplot() +
geom_function(fun = f) +
xlim(-720,720)+
ylim(-1,1)+
xlab('Degree')+
ylab('cos(x)')+
theme_bw()f <- function(x) {sin(pi*(x/180))}
ggplot() +
geom_function(fun = f) +
xlim(-720,720)+
ylim(-1,1)+
xlab('Degree')+
ylab('sin(x)')+
theme_bw()f <- function(x) {tan(pi*(x/180))}
ggplot() +
geom_function(fun = f) +
geom_vline(xintercept = -270,lty=2)+
geom_vline(xintercept = -90,lty=2)+
geom_vline(xintercept = 90,lty=2)+
geom_vline(xintercept = 270,lty=2)+
xlim(-360,360)+
ylim(-10,10)+
xlab('Degree')+
ylab('tan(x)')+
theme_bw()Functions don’t have to have a single input variable. Some functions may have multiple inputs. For instance, below is an example of a function with two inputs.
\[ f(x,y) = x^2 + 2y - 5\] Below this function looks like in 3D. x-axis and y-axis represent the input values and z-axis represent the output value based on the rule above. Note that most functions we will deal in machine learning are high dimensional with many variable inputs.
First, let’s define the difference quotient. The difference quotient is defined as the change in a function’s output value divided by the change in the function’s input value. Suppose we have a \(f(x)\) and find the output value for two input values, \(x_1\) and \(x_2\).
\[y_1 = f(x_1)\] \[y_2 = f(x_2)\] Then, the difference quotient is
\[ \frac{y_2 - y_1}{x_2 - x_1} = \frac{\Delta y}{\Delta x} = \frac{f(x_2) - f(x_1)}{x_2 - x_1},\]
where \(\Delta\) means change.
Let’s see an example. Consider the following function:
\[ f(x) = y = 5 x^2 -3*x + 10 .\] If we compute the output value for \(x_1 = 1\) and \(x_2 = 5\), the difference quotient is
\[\frac{\Delta y}{\Delta x} = \frac{f(5) - f(1)}{5 - 1} = = \frac{120 - 12}{5 - 1} = 27.\]
If you look at this visually, the difference quotient gives the slope of the line that connects the two points (1,12) and (5,120). This line is called the secant line.
f <- function(x) {5*x^2 -3*x + 10}
ggplot() +
geom_function(fun = f) +
geom_point(aes(x=1,y=12),size = 3) +
geom_point(aes(x=5,y=120),size=3) +
geom_abline(slope=27,intercept = -15,lty=2)+
xlim(-6,6)+
ylim(0,130)+
xlab('x')+
ylab('f(x)')+
theme_bw()Now, consider that the change in \(x\) is smaller. Instead of moving \(x\) from 1 to 5, let’s move \(x\) from 1 to 3.
\[\frac{\Delta y}{\Delta x} = \frac{f(3) - f(1)}{3 - 1} = = \frac{46 - 12}{3 - 1} = 17.\] The slope of the secant line, the line that connects (1,12) to (3,40), is now 17.
f <- function(x) {5*x^2 -3*x + 10}
ggplot() +
geom_function(fun = f) +
geom_point(aes(x=1,y=12),size=3) +
geom_point(aes(x=5,y=120),size=3) +
geom_abline(slope=27,intercept = -15,lty=2)+
geom_point(aes(x=3,y=46),size=3) +
geom_abline(slope=17,intercept = -5,lty=2)+
xlim(-6,6)+
ylim(0,130)+
xlab('x')+
ylab('f(x)')+
theme_bw()Now, consider that the change in \(x\) is such a tiny number, something not even noticeable such as 0.00001. So, let’s move \(x\) from 1 to 1.0001. The slope becomes 7.00005. In this case, secant line becomes a line that barely touches the function at \(x=1\).
\[\frac{\Delta y}{\Delta x} = \frac{f(1.00001) - f(1)}{1.00001 - 1} = 7.00005.\]
f <- function(x) {5*x^2 -3*x + 10}
ggplot() +
geom_function(fun = f) +
geom_point(aes(x=1,y=12),size=3) +
geom_point(aes(x=5,y=120),size=3) +
geom_abline(slope=27,intercept = -15,lty=2)+
geom_point(aes(x=3,y=46),size=3) +
geom_abline(slope=17,intercept = -5,lty=2)+
geom_abline(slope=7,intercept = 5,lty=2)+
xlim(-6,6)+
ylim(0,130)+
xlab('x')+
ylab('f(x)')+
theme_bw()The first-order derivative of a function is the limit of the difference quotient as the change in \(x\) approximates to zero. In other words, when the change in \(x\) becomes infinitely small, such a tiny number nobody has any idea what it is, the secant line becomes a line that barely touches to the function at \(x_1\) and called tangent line. The first-order derivative of a function is something that tells us about the slope of this tangent line.
The first-order derivative of a function is denoted as \(\frac{dy}{dx}\) and more formally defined as the following:
\[\frac{df(x)}{dx} = \frac{dy}{dx} = \lim_{\Delta x \to 0} \frac{f(x_1 + \Delta x) - f(x_1)}{\Delta x}\]
Most of the time you can numerically approximate the first-order derivative of a function by selecting such a small number for \(\Delta x\) (e.g., .00001) as we did above, and it works no matter how complex the function is. You can also analytically find the first-order derivative of a function but it may become a tedious job as the function becomes more complex and you have to be familiar with a number of derivative rules and most of the time be creative when finding a solution. In some cases, there may not even be an easy analytic solution and numerical approximation is the easiest solution.
You can get help from several tools while trying to find the first-order derivative of a function. There are online tools (e.g., [https://www.symbolab.com/solver/derivative-calculator]) you can use. You simply write the function and it returns the first-order derivative. In R, you can use the calculus package to find the derivatives of a function. For instance, see below for the example above.
require(calculus)
derivative(f = '5*x^2 -3*x + 10',
var = 'x',
order = 1)
# var = 'x', indicates that we are taking the derivative with respect to variable x
# order = 1, indicates that we are asking for the first derivative[1] "5 * (2 * x) - 3"
This indicates that the first-order derivative of the function \[ f(x) = 5 x^2 -3*x + 10 \] is \[ f'(x) = 10 x - 3.\] Note that the first derivative of \(f(x)\) is also denoted as \(f'(x)\).
\(f'(x = x_1)\) gives you the slope of the tangent line that touches to \(f(x)\) at \(x=x_1\). The slope of the tangent line at x=1 for this function can be directly obtained using the first derivative of the function.
\[f'(x) = 10*x - 3 \]
\[f'(1) = 10*1 - 3 = 7 \] It is very close to what we computed above using a tiny change in \(x\) above, \(\Delta x = 0.00001\). Analytical solution is the accurate one, but numerical solution typically approximates enough for most applications.
You can also use the same derivative() function from the calculus package to get the numerical approximation for the first-order derivative of any function.
derivative(f = '5*x^2 -3*x + 10',
var = c(x=1),
order = 1)
# var = c(x=1), indicates that we are not interested in symbolic derivative
# function and want to obtain a numerical solution of the
# derivative function at x=1[1] 7
Let’s find the first-order derivative at \(x=-1\).
[1] -13
This indicates that the value of the first-order derivative at \(x=-1\) is -13. Note that this is a negative number, so we understand that the slope of the tangent line that touches to this function at \(x=-1\) is negative.
f <- function(x) {5*x^2 -3*x + 10}
ggplot() +
geom_function(fun = f) +
geom_point(aes(x=-1,y=18),size=3) +
geom_abline(slope= - 13,intercept = 5,lty=2)+
xlim(-6,6)+
ylim(0,130)+
xlab('x')+
ylab('f(x)')+
theme_bw()As derivatives are also functions, we can always take a derivative of a derivative. If we take the derivative of a first-order derivative function, then it becomes the second-order derivative of the original function. For instance, see below for the first-order and second-order derivatives of the original function above.
derivative(f = '5*x^2 -3*x + 10',
var = 'x',
order = 1)
derivative(f = '5*x^2 -3*x + 10',
var = 'x',
order = 2)[1] "5 * (2 * x) - 3"
[1] "5 * 2"
\[ f(x) = y = 5 x^2 -3*x + 10 .\] \[ f'(x) = 10 x - 3.\] \[ f''(x) = 10.\] Below is an example of another function and its derivatives up to the fifth order.
derivative(f = '5*x^4 + 3*x^2 + 4*x + 5',
var = 'x',
order = 1)
derivative(f = '5*x^4 + 3*x^2 + 4*x + 5',
var = 'x',
order = 2)
derivative(f = '5*x^4 + 3*x^2 + 4*x + 5',
var = 'x',
order = 3)
derivative(f = '5*x^4 + 3*x^2 + 4*x + 5',
var = 'x',
order = 4)
derivative(f = '5*x^4 + 3*x^2 + 4*x + 5',
var = 'x',
order = 5)[1] "5 * (4 * x^3) + 3 * (2 * x) + 4"
[1] "5 * (4 * (3 * x^2)) + 3 * 2"
[1] "5 * (4 * (3 * (2 * x)))"
[1] "5 * (4 * (3 * 2))"
[1] "0"
\[ f(x) = 5x^4 -3x^2 + 4x + 5\]
\[ f'(x) = 20x^3 -6*x + 4\]
\[ f''(x) = 60x^2 -6\]
\[ f'''(x) = 120x\]
\[ f''''(x) = 120\]
\[ f'''''(x) = 0\]
A partial derivative is a derivative of a multivariate function with respective to a particular input variable while treating all other inputs as constant. The examples you read until this point considered a function with a single variable. Most of the time, we have to deal with functions with multiple variables, and they are sometimes highly complex. Consider the following function with two variables:
\[ f(x,y) = - \sqrt{(x-1)^2 + y^2} \]
We can take the first-order derivative of this function with respect to \(x\) while treating \(y\) as a constant, or we can take the first-order derivative of this function with respect to \(y\) while treating \(x\) as a constant. Without knowing anything about the rules of taking derivatives, you can find them using the derivative() function from the calculus package.
[1] "-(0.5 * (2 * (x - 1) * ((x - 1)^2 + y^2)^-0.5))"
[1] "sin(x^2/2 - y^2/4) * (sin(2 * x - exp(y)) * exp(y)) - cos(x^2/2 - y^2/4) * (2 * y/4) * cos(2 * x - exp(y))"
Below are the partial derivatives of this function with respect to \(x\) and \(y\).
\[ \frac{\partial f(x,y)}{\partial x} = \frac{1-x}{\sqrt{(x-1)^2 + y^2}}\]
\[ \frac{\partial f(x,y)}{\partial y} = \frac{-y}{\sqrt{(x-1)^2 + y^2}}\]
Note that we use \(\partial\) symbol instead of \(d\) to differentiate that these are partial derivatives.
The first-order derivative of a single variable function indicates the slope of a tangent line that touches to that function at a particular point. Then, how do we interpret the partial derivatives? It is best to visualize this to get a glimpse of it. Below are how this function looks like in 3D. I also added a counterplot to represent the same shape in 2D.
grid <- expand.grid(x=seq(-5,5,.1),y=seq(-5,5,.1))
grid$z <- -((grid[,1] - 1)^2 + grid[,2]^2)^.5
plot_ly(grid,
x = ~x,
y = ~y,
z = ~z,
marker = list(color = ~z,
colorscale = 'YlGnBu',
showscale = TRUE)) %>%
add_markers() %>%
layout(scene = list(xaxis=list(range = c(-5,5)),
yaxis=list(range = c(-5,5))))grid <- expand.grid(x=seq(-5,5,.1),y=seq(-5,5,.1))
grid$z <- -((grid[,1] - 1)^2 + grid[,2]^2)^.5
plot_ly(grid,
x = ~x,
y = ~y,
z = ~z,
type='contour',
colors = 'YlGnBu',
showscale=TRUE,
reversescale =T)Now, suppose that we dropped you with a helicopter on this surface and your coordinates are \(x = -2\) and \(y = 2\). Your goal is to hike to the top. The partial derivatives would help you to find your direction to the top in your hike. The partial derivatives with respect to \(x\) and with respect \(y\) would be equal to 0.832 and -0.554, respectively.
\[ \frac{\partial f(x,y)}{\partial x} = \frac{1-(-2)}{\sqrt{(-2-1)^2 + 2^2}} = 0.8320503\]
\[ \frac{\partial f(x,y)}{\partial y} = \frac{-2}{\sqrt{(-2-1)^2 + 2^2}} = -0.5547002\] Or, you can find these values using the gradient() function from the calculus package.
[1] 0.8320503 -0.5547002
If we add the gradient values with respect to \(x\) and with respect to \(y\) to the original coordinates of \(x\) and \(y\), then the new coordinate gives the direction with the largest slope at \((x,y)\) towards the maximum. In the graph below, the filled dot is (-2,2) and open circle is (-2 + 0.832, 2 - 0.554). The direction from the filled dot to the open circle gives the largest slope towards the maximum (steepest ascent). Or, if you want to reach to the maximum, it shows you the best next step.
The dimensions of gradient depends on the number of inputs in a function. Suppose \(f\) is a function with \(n\) inputs, then the gradient of this function will be a column vector with length \(n\).
\[ \Delta f = f'(x_1,x_2,...,x_n) = \begin{bmatrix} \frac{\partial y} {\partial x_1} \\ \frac{\partial y} {\partial x_2} \\ ... \\ \frac{\partial y} {\partial x_n} \end{bmatrix} \]
a <- -2
b <- 2
gr <- gradient(f = '-sqrt((x-1)^2 + y^2)',
var = c(x=a,y=b))
grid <- expand.grid(x=seq(-5,5,.1),y=seq(-5,5,.1))
grid$z <- -((grid[,1] - 1)^2 + grid[,2]^2)^.5
plot_ly(grid,
x = ~x,
y = ~y,
z = ~z,
type='contour',
colors = 'YlGnBu',
showscale=TRUE,
reversescale =T) %>%
add_trace(x = a,
y = b,
type = 'scatter',
mode = 'markers',
marker = list(color = 'black',size=8),
showlegend=FALSE,
showlegend = FALSE) %>%
add_trace(x = a+gr[1],
y = b+gr[2],
type = 'scatter',
mode = 'markers',
symbols = c('x'),
marker = list(color = 'black',size=8,symbol = 'circle-open'),
showlegend = FALSE) %>%
add_annotations(ax = a,
ay = b,
x = a + gr[1],
y = b + gr[2],
axref = 'x',
ayref = 'y',
xref = 'x',
yref = 'y',
text = '',
showlegend=FALSE)Note. We here assume that we are searching for the maximum. If we are searching for a minimum, then the open circle would be (-2 - 0.832, 2 + 0.554) and provide the direction with the largest slope towards the minimum (steepest descent).
If we take the first-order partial derivatives from a multivariate function and take derivative again with respect to each variable, then we obtain the second-order partial derivatives. The Hessian matrix is a way of organizing the second-order partial derivatives. For instance, consider the function used in the previous section.
\[ f(x,y) = - \sqrt{(x-1)^2 + y^2} \]
We found its first-order partial derivative with respect to \(x\) as the following..
\[ \frac{\partial f(x,y)}{\partial x} = \frac{1-x}{\sqrt{(x-1)^2 + y^2}} \]
\(\frac{\partial f(x,y)}{\partial x}\) is a function itself, so we can again take its derivatives with respect to \(x\) and \(y\).
\[\frac{\partial f^2(x,y)}{\partial x^2} = \frac{\partial}{\partial x} \left ( \frac{\partial f(x,y)}{\partial x} \right )\] \[\frac{\partial f^2(x,y)}{\partial x \partial y} = \frac{\partial}{\partial y} \left ( \frac{\partial f(x,y)}{\partial x} \right )\]
[1] "-(1/sqrt((x - 1)^2 + y^2) + (1 - x) * (0.5 * (2 * (x - 1) * ((x - 1)^2 + y^2)^-0.5))/sqrt((x - 1)^2 + y^2)^2)"
[1] "-((1 - x) * (0.5 * (2 * y * ((x - 1)^2 + y^2)^-0.5))/sqrt((x - 1)^2 + y^2)^2)"
This gets pretty ugly! I simplified it for you.
\[\frac{\partial f^2(x,y)}{\partial x^2} = -\dfrac{y^2}{\left(\left(x-1\right)^2+y^2\right)^\frac{3}{2}}\]
\[\frac{\partial f^2(x,y)}{\partial x \partial y} = \dfrac{\left(x-1\right)y}{\left(y^2+\left(x-1\right)^2\right)^\frac{3}{2}}\]
We are not done yet! These are the derivatives of the first-order partial derivative with respect to \(x\). We can do the same thing for the first-order partial derivative with respect to \(y\).
\[\frac{\partial f^2(x,y)}{\partial y^2} = \frac{\partial}{\partial y} \left ( \frac{\partial f(x,y)}{\partial y} \right )\]
\[\frac{\partial f^2(x,y)}{\partial y \partial x} = \frac{\partial}{\partial x} \left ( \frac{\partial f(x,y)}{\partial y} \right )\]
[1] "-((-y) * (0.5 * (2 * (x - 1) * ((x - 1)^2 + y^2)^-0.5))/sqrt((x - 1)^2 + y^2)^2)"
[1] "-((1 - x) * (0.5 * (2 * y * ((x - 1)^2 + y^2)^-0.5))/sqrt((x - 1)^2 + y^2)^2)"
They are simplified for you.
\[\frac{\partial f^2(x,y)}{\partial y^2} = -\dfrac{\left(x-1\right)^2}{\left(y^2+\left(x-1\right)^2\right)^\frac{3}{2}}\]
\[\frac{\partial f^2(x,y)}{\partial y \partial x} = \dfrac{y\left(x-1\right)}{\left(\left(x-1\right)^2+y^2\right)^\frac{3}{2}}\] We can organize these second-order derivatives in a 2 x 2 Hessian matrix. Since our function has only two input variables, the Hessian matrix will be a 2 x 2 matrix. For a function with \(n\) input variables, the Hessian matrix would be an \(n\) by \(n\) matrix.
\[ \mathbf{H} = \begin{bmatrix} \frac{\partial f^2(x,y)}{\partial x^2} & \frac{\partial f^2(x,y)}{\partial x \partial y} \\ \frac{\partial f^2(x,y)}{\partial y \partial x} & \frac{\partial f^2(x,y)}{\partial y^2} \end{bmatrix}\]
In the previous section, we calculated the gradient vector for \((x,y) = (-2,2)\). Now, let’s compute the Hessian matrix for the same \((x,y) = (-2,2)\) coordinate.
\[\frac{\partial f^2(x,y)}{\partial x^2} = -\dfrac{2^2}{\left(\left(-2-1\right)^2+2^2\right)^\frac{3}{2}} = -0.08533849\]
\[\frac{\partial f^2(x,y)}{\partial x \partial y} = \dfrac{\left(-2-1\right)2}{\left(2^2+\left(-2-1\right)^2\right)^\frac{3}{2}} = -0.1280077\]
\[\frac{\partial f^2(x,y)}{\partial y^2} = -\dfrac{\left(-2-1\right)^2}{\left(2^2+\left(-2-1\right)^2\right)^\frac{3}{2}} = -0.1920116\]
\[\frac{\partial f^2(x,y)}{\partial y \partial x} = \dfrac{2\left(-2-1\right)}{\left(\left(-2-1\right)^2+2^2\right)^\frac{3}{2}} = -0.1280077\]
\[ \mathbf{H} = \begin{bmatrix} -0.085 & -0.128 \\ -0.128 & 0.192 \end{bmatrix} \]
Luckily, we don’t have to do this tedious work by ourself. We can use the hessian() function from the calculus package to compute the hessian matrix for any given function.
[,1] [,2]
[1,] -0.08533849 -0.1280077
[2,] -0.12800774 -0.1920116
The optimization is finding a minimum or maximum of a function. The concept of gradient and Hessian can be used to find a minimum or maximum of a given function although they have their limitations. Here, we will focus on two methods, steepest ascent and Newton’s algorithm, to give a sense of how optimization process looks like.
Let’s consider the following function and its first-order derivatives (gradient vector). We would like to know the coordinates of the maximum. At which specific coordinates, does this function reach to a maximum value? We can have a guess that it is somewhere between 0 and 2 in terms of the x-axis and somewhere between -1 and 1 in terms of the y-axis by looking at the plot. We can actually mathematically show that it is exactly \((x,y) = (1,0)\), but we pretend that we do not know that.
\[ f(x,y) = - \sqrt{(x-1)^2 + y^2} \]
grid <- expand.grid(x=seq(-5,5,.1),y=seq(-5,5,.1))
grid$z <- -((grid[,1] - 1)^2 + grid[,2]^2)^.5
plot_ly(grid,
x = ~x,
y = ~y,
z = ~z,
type='contour',
colors = 'YlGnBu',
showscale=TRUE,
reversescale =T)Each optimization process starts with a wild guess about where the minimum or maximum is. Depending on the first guess, we can evaluate the gradient to update our guess. In a more formal way, we can summarize the steepest ascent for this function as the following.
Step 0: Wild estimate about where the maximum is.
\[ (x_0,y_0)\]
Step 1: Update your first estimate.
\[ (x_1,y_1) = (x_0,y_0) + \alpha f'(x_0,y_0)\]
StepY 2: Continue updating your estimate until it doesn’t significantly improve anymore.
\[ (x_{n+1},y_{n+1}) = (x_n,y_n) + \alpha f'(x_n,y_n)\]
\(f'(x_n,y_n)\) is the gradient vector at the n^th step and \(\alpha\) is known as the step size. While gradient indicates which direction to go for the next step, \(\alpha\) indicates how big of a step to take in that direction. In machine learning literature, they call \(\alpha\) the learning rate.
Let’s implement this algorithm for our function. For instance, suppose our initial estimate about the location of the maximum point is \((x,y) = c(-2,2)\). Below is an R script using a while loop to iterate this procedure until one of the two conditions is met. Either the change in function value is negative (so the function doesn’t improve) or the number of iterations is less than 500. I used an \(\alpha\) value of 0.1 at each step.
my.f <- function(x,y){
-sqrt((x-1)^2 + y^2)
}
f <- '-sqrt((x-1)^2 + y^2)'
iter <- data.frame(x=NA,y=NA,z=NA)
iter[1,c('x','y')] <- c(-2,2)
iter[1,]$z <- my.f(x=iter[1,1],y=iter[1,2])
change <- 1
i <- 1
while((change > 0) & (i < 500)){
der <- gradient(f = f,
var = c(x=iter[i,1],y=iter[i,2]))
i <- i + 1
iter[i,1:2] <- iter[i-1,1:2] + 0.1*der
iter[i,]$z <- my.f(x=iter[i,1],y=iter[i,2])
change <- iter[i,]$z - iter[i-1,]$z
#cat('Iteration :', i, 'Value :', round(iter[i,]$z,4),'\n')
}
iter x y z
1 -2.000000000 2.000000000 -3.605551275
2 -1.916794971 1.944529980 -3.505551275
3 -1.833589941 1.889059961 -3.405551275
4 -1.750384912 1.833589941 -3.305551275
5 -1.667179882 1.778119922 -3.205551275
6 -1.583974853 1.722649902 -3.105551275
7 -1.500769823 1.667179882 -3.005551275
8 -1.417564794 1.611709863 -2.905551275
9 -1.334359765 1.556239843 -2.805551275
10 -1.251154735 1.500769823 -2.705551275
11 -1.167949706 1.445299804 -2.605551275
12 -1.084744676 1.389829784 -2.505551275
13 -1.001539647 1.334359765 -2.405551275
14 -0.918334617 1.278889745 -2.305551275
15 -0.835129588 1.223419725 -2.205551275
16 -0.751924558 1.167949706 -2.105551275
17 -0.668719529 1.112479686 -2.005551275
18 -0.585514500 1.057009666 -1.905551275
19 -0.502309470 1.001539647 -1.805551275
20 -0.419104441 0.946069627 -1.705551275
21 -0.335899411 0.890599608 -1.605551275
22 -0.252694382 0.835129588 -1.505551275
23 -0.169489352 0.779659568 -1.405551275
24 -0.086284323 0.724189549 -1.305551275
25 -0.003079294 0.668719529 -1.205551275
26 0.080125736 0.613249509 -1.105551275
27 0.163330765 0.557779490 -1.005551275
28 0.246535795 0.502309470 -0.905551275
29 0.329740824 0.446839451 -0.805551275
30 0.412945854 0.391369431 -0.705551275
31 0.496150883 0.335899411 -0.605551275
32 0.579355912 0.280429392 -0.505551275
33 0.662560942 0.224959372 -0.405551275
34 0.745765971 0.169489352 -0.305551275
35 0.828971001 0.114019333 -0.205551275
36 0.912176030 0.058549313 -0.105551275
37 0.995381060 0.003079294 -0.005551275
38 1.078586089 -0.052390726 -0.094448725
As you see, we start from the initial estimate of (-2,2), and the steepest ascent algorithm took us to a location of approximately 1 and 0. In this case, it is stuck between two data points, (0.995,.003) and (1.079, -0.052), and starts going back and forth between these two data points without finding a final solution. Let’s see how this journey to the top visually looks like.
If we start from another location, it will still find the way to the top. Below is an example when the starting position is \((x,y) = (4,3)\).
For this example, it is straightforward to go to the top for the steepest ascent algorithm because the surface is not that complex and there is a single peak.
Now, let’s see how the steepest ascent behaves for a more complicated surface. Consider the following function and its 3D representation. As you see, this function has a more complex surface. Instead of an obvious single peak, there were three of them. The highest peak (global maximum) was around the coordinates of \((x,y,z) = (2.03,1.40,1.00)\). The second highest peak (local maximum) was around the coordinates of \((x,y,z) = (3.00,1.01,0.89)\), and the third highest peak (local maximum) was around the coordinates of \((x,y,z) = (0.34,1.43,0.41)\).
The code below implements the steepest ascent algorithm using the starting values of \((x,y)=(0.1,0.3)\). After 134 iterations, the algorithm successfully finds the global maximum. As you will see, the path is not very straightforward, but the algorithm successfully finds the maximum point.
\[ f(x,y) = sin(\frac{x^2}{2} - \frac{y^2}{4}) \times cos(2x - e^y) \]
grid <- expand.grid(x=seq(-.5,3,.01),y=seq(-.5,2,.01))
grid$z <- sin(grid[,1]^2/2 - grid[,2]^2/4)*cos(2*grid[,1] - exp(grid[,2]))
plot_ly(grid,
x = ~x,
y = ~y,
z = ~z,
marker = list(color = ~z,
colorscale = 'YlGnBu',
showscale = FALSE)) %>%
add_markers() %>%
layout(scene = list(xaxis=list(range = c(3,-.5)),
yaxis=list(range = c(2,-.5))))plot_ly(grid,
x = ~x,
y = ~y,
z = ~z,
type='contour',
colors = 'YlGnBu',
showscale=TRUE,
reversescale =T)my.f <- function(x,y){
sin(x^2/2 - y^2/4)*cos(2*x - exp(y))
}
f <- 'sin(x^2/2 - y^2/4)*cos(2*x - exp(y))'
iter <- data.frame(x=NA,y=NA,z=NA)
iter[1,c('x','y')] <- c(0.1,0.3)
iter[1,]$z <- my.f(x=iter[1,1],y=iter[1,2])
change <- 1
i <- 1
while((change > 0) & (i < 500)){
der <- gradient(f = f,
var = c(x=iter[i,1],y=iter[i,2]))
i <- i + 1
iter[i,1:2] <- iter[i-1,1:2] + 0.1*der
iter[i,]$z <- my.f(x=iter[i,1],y=iter[i,2])
change <- iter[i,]$z - iter[i-1,]$z
#cat('Iteration :', i, 'Value :', round(iter[i,]$z,4),'\n')
}
iter x y z
1 0.1000000 0.30000000 -0.0071504205
2 0.1008912 0.29602763 -0.0069813460
3 0.1020187 0.29194132 -0.0067977154
4 0.1034053 0.28773472 -0.0065968698
5 0.1050763 0.28340116 -0.0063756112
6 0.1070600 0.27893368 -0.0061300776
7 0.1093883 0.27432503 -0.0058555844
8 0.1120970 0.26956758 -0.0055464208
9 0.1152266 0.26465337 -0.0051955882
10 0.1188231 0.25957407 -0.0047944609
11 0.1229389 0.25432090 -0.0043323444
12 0.1276340 0.24888465 -0.0037958946
13 0.1329773 0.24325564 -0.0031683534
14 0.1390484 0.23742368 -0.0024285336
15 0.1459393 0.23137804 -0.0015494662
16 0.1537570 0.22510747 -0.0004965909
17 0.1626265 0.21860025 0.0007746738
18 0.1726937 0.21184429 0.0023221945
19 0.1841297 0.20482739 0.0042215033
20 0.1971347 0.19753767 0.0065716441
21 0.2119437 0.18996440 0.0095028879
22 0.2288305 0.18209930 0.0131867005
23 0.2481124 0.17393880 0.0178481404
24 0.2701521 0.16548759 0.0237802259
25 0.2953554 0.15676428 0.0313582972
26 0.3241596 0.14781027 0.0410492716
27 0.3570064 0.13870292 0.0534050090
28 0.3942899 0.12957472 0.0690203203
29 0.4362666 0.12063880 0.0884268234
30 0.4829214 0.11221948 0.1118938916
31 0.5337922 0.10478034 0.1391400984
32 0.5877966 0.09893417 0.1690491491
33 0.6431526 0.09541006 0.1996126398
34 0.6975181 0.09495643 0.2283448756
35 0.7484210 0.09818703 0.2531358538
36 0.7938628 0.10542572 0.2730422868
37 0.8328014 0.11663056 0.2884357842
38 0.8652565 0.13143840 0.3004837631
39 0.8920441 0.14929465 0.3104729053
40 0.9143642 0.16959458 0.3194073566
41 0.9334547 0.19178414 0.3279282823
42 0.9503955 0.21540901 0.3363969055
43 0.9660403 0.24012515 0.3450046742
44 0.9810237 0.26568858 0.3538551580
45 0.9958002 0.29193711 0.3630122202
46 1.0106884 0.31877134 0.3725247450
47 1.0259118 0.34613811 0.3824382444
48 1.0416297 0.37401718 0.3928000166
49 1.0579605 0.40241117 0.4036614191
50 1.0749980 0.43133813 0.4150789479
51 1.0928223 0.46082616 0.4271148454
52 1.1115073 0.49090949 0.4398374997
53 1.1311254 0.52162552 0.4533216861
54 1.1517500 0.55301243 0.4676485993
55 1.1734573 0.58510701 0.4829055548
56 1.1963267 0.61794243 0.4991851800
57 1.2204405 0.65154579 0.5165838411
58 1.2458823 0.68593507 0.5351989725
59 1.2727353 0.72111559 0.5551248792
60 1.3010781 0.75707548 0.5764464948
61 1.3309802 0.79378038 0.5992305253
62 1.3624948 0.83116728 0.6235134460
63 1.3956495 0.86913761 0.6492860416
64 1.4304347 0.90755028 0.6764747143
65 1.4667896 0.94621528 0.7049207571
66 1.5045863 0.98488928 0.7343602923
67 1.5436142 1.02327483 0.7644094691
68 1.5835664 1.06102504 0.7945613095
69 1.6240339 1.09775519 0.8242012626
70 1.6645099 1.13306188 0.8526467339
71 1.7044101 1.16654851 0.8792105744
72 1.7431088 1.19785437 0.9032802740
73 1.7799881 1.22668286 0.9243962788
74 1.8144933 1.25282418 0.9423090654
75 1.8461814 1.27616884 0.9569988933
76 1.8747561 1.29670962 0.9686538400
77 1.9000796 1.31453271 0.9776153357
78 1.9221643 1.32980031 0.9843091842
79 1.9411477 1.34272861 0.9891805364
80 1.9572583 1.35356521 0.9926451205
81 1.9707810 1.36256891 0.9950607697
82 1.9820254 1.36999383 0.9967168636
83 1.9913019 1.37607831 0.9978362518
84 1.9989051 1.38103825 0.9985840146
85 2.0051032 1.38506402 0.9990787084
86 2.0101337 1.38832001 0.9994033960
87 2.0142020 1.39094581 0.9996151315
88 2.0174826 1.39305844 0.9997524891
89 2.0201218 1.39475498 0.9998412211
90 2.0222409 1.39611531 0.9998983471
91 2.0239400 1.39720472 0.9999350249
92 2.0253006 1.39807631 0.9999585224
93 2.0263891 1.39877308 0.9999735496
94 2.0272593 1.39932975 0.9999831465
95 2.0279544 1.39977425 0.9999892684
96 2.0285094 1.40012906 0.9999931702
97 2.0289525 1.40041217 0.9999956552
98 2.0293060 1.40063802 0.9999972370
99 2.0295880 1.40081815 0.9999982434
100 2.0298129 1.40096180 0.9999988834
101 2.0299923 1.40107633 0.9999992904
102 2.0301353 1.40116764 0.9999995491
103 2.0302493 1.40124044 0.9999997135
104 2.0303402 1.40129846 0.9999998180
105 2.0304126 1.40134471 0.9999998844
106 2.0304704 1.40138158 0.9999999266
107 2.0305164 1.40141096 0.9999999533
108 2.0305531 1.40143438 0.9999999704
109 2.0305823 1.40145305 0.9999999812
110 2.0306056 1.40146792 0.9999999880
111 2.0306242 1.40147978 0.9999999924
112 2.0306390 1.40148922 0.9999999952
113 2.0306508 1.40149675 0.9999999969
114 2.0306602 1.40150276 0.9999999981
115 2.0306677 1.40150754 0.9999999988
116 2.0306736 1.40151135 0.9999999992
117 2.0306784 1.40151438 0.9999999995
118 2.0306822 1.40151681 0.9999999997
119 2.0306852 1.40151873 0.9999999998
120 2.0306876 1.40152028 0.9999999999
121 2.0306896 1.40152149 0.9999999999
122 2.0306911 1.40152248 0.9999999999
123 2.0306923 1.40152324 1.0000000000
124 2.0306933 1.40152389 1.0000000000
125 2.0306941 1.40152435 1.0000000000
126 2.0306946 1.40152478 1.0000000000
127 2.0306952 1.40152504 1.0000000000
128 2.0306955 1.40152536 1.0000000000
129 2.0306959 1.40152547 1.0000000000
130 2.0306961 1.40152575 1.0000000000
131 2.0306964 1.40152572 1.0000000000
132 2.0306964 1.40152602 1.0000000000
133 2.0306967 1.40152584 1.0000000000
134 2.0306966 1.40152625 1.0000000000